Hand gesture detection method and apparatus, and computer storage medium

ABSTRACT

Provided are a hand gesture detection method and device, and a computer storage medium. The method includes: obtaining an initial depth image including a hand to be detected, and performing detection processing on the initial depth image by using a backbone feature extractor and a bounding box detection model to obtain initial bounding boxes and a first feature map corresponding to the hand to be detected; determining one of the initial bounding boxes as a target bounding box; cropping, based on the target bounding box, the first feature map by using RoIAlign feature extractor, to obtain a second feature map corresponding to the hand to be detected; and performing, based on the second feature map, a three-dimensional gesture estimation processing on the hand to be detected by using a gesture estimation model to obtain a gesture detection result of the hand to be detected.

This application is a continuation of International Application No.PCT/CN2020/129258, filed on Nov. 17, 2020, which claims priority to anearlier U.S. provisional patent application No. 62/938,176, filing onNov. 20, 2019 and entitled “CASCADED HAND DETECTION AND 3D HAND GESTUREESTIMATION FOR A MOBILE TOF CAMERA”. The disclosures of theaforementioned applications are hereby incorporated by reference intheir entireties.

FIELD

The embodiments of the present disclosure relate to the technical fieldof image recognition, and in particular, to a hand gesture detectionmethod, a hand gesture detection apparatus, and a computer storagemedium.

BACKGROUND

The ability to accurately and efficiently reconstruct the motion of thehuman hand from images promises exciting new applications in immersivevirtual and augmented realities, robotic control, and sign languagerecognition. There has been great progress in recent years, especiallywith the arrival of consumer depth cameras.

However, it remains a challenging task due to unconstrained global andlocal pose variations, frequent occlusion, local self-similarity, and ahigh degree of articulation.

SUMMARY

The present disclosure provides a hand gesture detection method, a handgesture detection apparatus, and a computer storage medium, which cangreatly improve the detection efficiency and accuracy of the handgesture.

The technical solutions of the present disclosure can be realized asfollows.

In a first aspect, an embodiment of the present disclosure provides ahand gesture detection method. The method includes: obtaining an initialdepth image comprising a hand to be detected, and performing detectionprocessing on the initial depth image by using a backbone featureextractor and a bounding box detection model, to obtain initial boundingboxes and a first feature map corresponding to the hand to be detected;determining a target bounding box based on the initial bounding boxes,the target bounding box being one of the initial bounding boxes;cropping, based on the target bounding box, the first feature map byusing an RoIAlign feature extractor, to obtain a second feature mapcorresponding to the hand to be detected; and performing, based on thesecond feature map, a three-dimensional gesture estimation processing onthe hand to be detected by using a gesture estimation model to obtain agesture detection result of the hand to be detected.

In a second aspect, an embodiment of the present disclosure provides ahand gesture detection apparatus. The hand gesture detection apparatusincludes: an obtaining component, a detection component, a determiningcomponent, a cropping component, an estimation component. The obtainingunit is configured to obtain an initial depth image including a hand tobe detected. The detection component is configured to perform detectionprocessing on the initial depth image by using a backbone featureextractor and a bounding box detection model to obtain initial boundingboxes and a first feature map corresponding to the hand to be detected.The determining component is configured to determine a target boundingbox based on the initial bounding boxes, the target bounding box beingone of the initial bounding boxes. The cropping component is configuredto crop, based on the target bounding box, the first feature map byusing an RoIAlign feature extractor to obtain a second feature mapcorresponding to the hand to be detected. The estimation component isconfigured to perform, based on the second feature map, athree-dimensional gesture estimation processing on the hand to bedetected by using a gesture estimation model to obtain a gesturedetection result of the hand to be detected.

In a third aspect, an embodiment of the present disclosure provides ahand gesture detection apparatus. The hand gesture detection apparatusincludes a processor and a memory having instructions stored thereon andexecutable by the processor. The instructions, when executed by theprocessor, implement the above-described hand gesture detection method.

In a fourth aspect, an embodiment of the present disclosure provides acomputer storage medium having a program stored thereon and applied to ahand gesture detection apparatus. The program, when executed by aprocessor, implements the above-mentioned hand gesture detection method.

In the hand gesture detection method provided by the embodiments of thepresent disclosure, the hand gesture detection apparatus obtains theinitial depth image including a hand to be detected, and performs thedetection processing on the initial depth image by using a backbonefeature extractor and a bounding box detection model, to obtain initialbounding boxes and a first feature map corresponding to the hand to bedetected; based on the initial bounding boxes, the hand gesturedetection apparatus determines a target bounding box, which is one ofthe initial bounding boxes; based on the target bounding box, the handgesture detection apparatus crops the first feature map using anRoIAlign feature extractor, to obtain the second feature mapcorresponding to the hand to be detected; based on the second featuremap, the hand gesture detection apparatus performs three-dimensionalgesture estimation processing on the hand to be detected using a gestureestimation model, to obtain a gesture detection result of the hand to bedetected. In other words, in the embodiments of the present disclosure,when the hand gesture detection apparatus performs the hand gesturedetection processing, it can combine the two tasks of the hand detectionand the hand gesture estimation end-to-end. Specifically, the handgesture detection apparatus can couple the output result of the handdetection with the input end of the hand gesture estimation through theRoIAlign feature extractor, and it can use the second feature map outputby the RoIAlign feature extractor as the input of the gesture estimationmodel to complete the hand gesture detection. In view of the above, inthe hand gesture detection method proposed in the embodiments of thepresent disclosure, the backbone feature extractor is only used toperform one feature extraction on the initial depth image, therebyachieving the joint processing of the hand detection and the handgesture estimation. Therefore, the amount of computation can be greatlyreduced, and the detection efficiency and accuracy of the hand gesturecan be effectively improved.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram of an image captured by a TOF cameraprovided in a related technical solution.

FIG. 2 is a schematic diagram of a detection result of a hand boundingbox provided in a related technical solution.

FIG. 3 is a schematic diagram of key point positions of a hand skeletonin the related art.

FIG. 4 is a schematic diagram of a two-dimensional hand gestureestimation result in the related art.

FIG. 5 is a schematic flowchart of a conventional hand gesture detectionaccording to a related technical solution.

FIG. 6 is a schematic diagram of a bilinear interpolation effect ofROIAlign in the related art.

FIG. 7 is a schematic diagram illustrating an effect of non-maximumsuppression in the related art.

FIG. 8a and FIG. 8b are schematic diagrams of a union and anintersection in the related art.

FIG. 9 is a first schematic flowchart of a hand gesture detection methodaccording to an embodiment of the present disclosure.

FIG. 10 is a second schematic flowchart of a hand gesture detectionmethod according to an embodiment of the present disclosure.

FIG. 11 is a schematic diagram of an architecture of a hand gestureestimation method according to an embodiment of the present disclosure.

FIG. 12 is a third schematic flowchart of a hand gesture detectionmethod according to an embodiment of the present disclosure.

FIG. 13 is a first schematic diagram of a structural composition of ahand gesture detection apparatus according to an embodiment of thepresent disclosure.

FIG. 14 is a second schematic diagram of a structural composition of ahand gesture detection apparatus according to an embodiment of thepresent disclosure.

DESCRIPTION OF EMBODIMENTS

In order to have a more detailed understanding of the features andtechnical contents of the embodiments of the present disclosure, theimplementation of the embodiments of the present disclosure will bedescribed in detail below with reference to the accompanying drawings.The accompanying drawings are only for reference and description, butare not intended to limit the embodiments of the present disclosure.

Hand gesture estimation mainly refers to an accurately estimation of athree-dimensional coordinate position of a human hand skeleton nodesfrom an image. It is a key issue in the fields of computer vision andhuman-computer interaction, and it is also of great significance in thefields such as virtual reality, augmented reality, non-contactinteraction, and gesture recognition. With rise and development ofcommercial, inexpensive depth cameras, great progress has been made inhand gesture estimation.

The depth cameras include a structured light camera, a laser scanningcamera, and a time of flight (TOF) camera, etc. In most cases, the TOFcameras are used.

Three-dimensional (3D) imaging of a time-of-flight method is tocontinuously transmit a light pulse to an object, and then use a sensorto receive light returned from the object, and obtain a target distanceof the object by detecting flight (round-trip) time of the light pulse.Specifically, the TOF camera is a range imaging camera system, in whichadopts the time-of-flight method to calculate a distance between the TOFcamera and the captured object for each point in the image by measuringround-trip time of an artificial light signal provided by a laser or alight-emitting diode (LED).

The TOF camera outputs an image with a size of H×W, and each pixel valueon the two-dimensional (2D) image may represent a depth value of thepixel. The pixel value ranges from 0 to 3000 millimeter (mm). FIG. 1 isa schematic diagram of an image captured by a TOF camera provided in arelated technical solution. In the embodiments of the presentdisclosure, the image captured by the TOF camera may be referred to as adepth image.

For example, the TOF camera provided by manufacturer OPPO differs fromthose made by other manufacturers in the following aspects: (1) it canbe installed inside a smartphone instead of being fixed on a staticbracket; (2) it has lower power consumption than those made by othermanufacturers (such as Microsoft Kinect, Intel Realsense, etc.); and (3)it has lower image resolution, such as 240×180, while the typical valueof image resolution is 640×480.

It can be understood that an input of the hand detection is a depthimage, and an output thereof is a probability of hand presence (i.e., anumber ranging from 0 to 1, a larger value indicating a greaterconfidence of the hand presence) and a hand bounding box (i.e., abounding box representing a position and a size of the hand). FIG. 2illustrates a schematic diagram of a detection result of a hand boundingbox according to a related technical solution. As illustrated in FIG. 2,the black rectangle is the hand bounding box, and a score of the handbounding box is as high as 0.999884.

In the embodiments of the present disclosure, the bounding box may bereferred to as a boundary box. The bounding box can be expressed as(xmin, ymin, xmax, ymax), where (xmin, ymin) represents an upper leftcorner position of the bounding box, and (xmax, ymax) represents a lowerright corner position of the bounding box.

Specifically, in a process of 2D hand gesture estimation, an input is adepth image, and an output is a 2D key point position of a handskeleton. An example of key point positions of a hand skeleton isillustrated in FIG. 3. FIG. 3 is a schematic diagram of key pointpositions of a hand skeleton in the related art. The hand skeleton canbe set to have 20 key points, and the position of each key point isillustrated as 0 to 19 in FIG. 3. The position of each key point can berepresented by 2D coordinate information (x,y), where x representscoordinate information in a direction of a horizontal image axis, and yrepresents coordinate information in a direction of a vertical imageaxis. Illustratively, after the coordinate information of the 20 keypoints is determined, a two-dimensional hand gesture detection result isillustrated in FIG. 4. FIG. 4 is a schematic diagram of atwo-dimensional hand gesture estimation result in the related art.

In a process of 3D hand gesture estimation, an input is still a depthimage, and an output is a 3D key point position of a hand skeleton. Anexample of key point positions of a hand skeleton is illustrated in FIG.3. The position of each key point can be represented by 3D coordinateinformation (x,y,z), where x represents coordinate information in adirection of a horizontal image axis, y represents coordinateinformation in a direction of a vertical image axis, and z representscoordinate information in a depth direction. The problems to be solvedby the embodiments of the present disclosure are mainly related to thethree-dimensional hand gesture estimation.

At present, a typical hand gesture detection process may include a handdetection part and a hand gesture estimation part. The hand detectionpart may include a backbone feature extractor and a bounding boxdetection head module, and the hand gesture estimation part may includea backbone feature extractor and a gesture estimation head module. Forexample, FIG. 5 is a schematic flowchart of a conventional hand gesturedetection provided in a related technical solution. As illustrated inFIG. 5, after an initial depth image including a hand is obtained, thehand detection may be performed, that is, the backbone feature extractorand the bounding box detection head module included in the handdetection part are used to perform detection processing. At this time,the boundary of the bounding box can be adjusted, then an image can becropped by using the adjusted bounding box. The hand gesture estimationis performed on the cropped image, that is, the backbone featureextractor and gesture estimation head module included in the handgesture estimation part are used to perform gesture estimationprocessing. It should be noted that the tasks of the hand detection andthe hand gesture estimation are completely separated. In order to couplethese two tasks, the output position of the bounding box is adjusted tothe mass center of pixels within the bounding box, and the bounding boxis slightly enlarged to include all pixels of the hand. The initialdepth image is cropped by using the adjusted bounding box. The croppedimage is inputted into the hand gesture estimation. The backbone featureextractor is used twice to extract image features, which may lead torepeated computation, thereby resulting in huge amount of computation.

In this case, RoIAlign may be introduced. The ROIAlign is a regionalfeature aggregation method, which can well solve the problem of regionalmismatch caused by two quantization procedures in the ROI Pooloperations. In the detection task, the accuracy of the detection resultcan be improved by replacing the ROI Pool with the ROIAlign. That is,the RoIAlign layer removes the harsh quantization of the RoIPool andcorrectly aligns the extracted features with the input. Here, anyquantization of RoI boundaries or bins can be avoided, e.g., x/16 can beused here instead of [x/16]. In addition, bilinear interpolation canalso be used to calculate the exact values of the input features at fourregularly sampled locations in each RoI bin, and the result isaggregated (using the maximum value or the average value). FIG. 6 is aschematic diagram of a linear interpolation effect in the related art.In FIG. 6, the visual grid represents a feature map, the bold solid linerepresents an RoI (such as 2×2 bins), and 4 sampling points are dottedin each bin. The RoIAlign uses the adjacent grid points on the featuremap to perform bilinear interpolation calculations, so as to obtain thevalue of each sampling point. No quantization is performed on anycoordinates involved in ROI, ROI bins, or the sampling points. It shouldalso be noted that, the detection results are not sensitive to theaccuracy of sampling locations or the number of the sampling points, aslong as no quantization is performed.

In addition, Non-Maximum Suppression (NMS) has been widely used inseveral key aspects of computer vision and is a part of variousdetection methods such as edge, corner, or object detection. Itsnecessity is caused by that the imperfect ability of detectionalgorithms for localizing the concept of interest results in severaldetection groups appearing in the vicinity of the real location.

In the context of object detection, the methods based on window-slidingoften produce multiple high-scoring windows, which are close to thecorrect location of the object. This is a consequence of thegeneralization ability of the object detector, the smoothness of theresponse function, and the visual correlation of close-by windows. Therelatively dense output is often unsatisfactory for understanding thecontent of an image. In fact, in this step, the assumed number ofwindows is uncorrelated with the real number of objects in the image.Therefore, the goal of NMS is to retain only one window per detectiongroup, corresponding to the precise local maximum of the responsefunction, to optimally obtain only one detection per object. FIG. 7 is aschematic diagram of an effect of non-maximum suppression in the relatedart. A specific example of NMS is illustrated in FIG. 7, and the NMSaims to retain one window (the bold gray rectangle in FIG. 7).

FIG. 8a and FIG. 8b are schematic diagrams of a union and anintersection in the related art. In FIG. 8a and FIG. 8b , two boundingboxes are provided and denoted by BB1 and BB2, respectively. The blackarea in FIG. 8a is the intersection of BB1 and BB2, which is denoted byBB1∩BB2, i.e., the overlapping area of B1 and BB2; and the black area inFIG. 8b is the union of BB1 and BB2, which is denoted by BB1∪BB2, i.e.,the merged area of BB1 and BB2. Specifically, the calculation formula ofthe intersection-to-union ratio (represented by IoU) is presented by thefollowing equation:

IoU=(Area of Overlap)/(Area of Union)=(BB1∩BB2)/(BB1∪BB2)  (1)

In the current hand gesture detection scheme, the hand detection and thehand gesture estimation are separated from each other in a typical handgesture detection. In an offline training process, a hand detectionmodel and a hand gesture estimation model are established, respectively,as two consecutive components in a flow line. The training data of handgesture estimation may suffer a mismatch (compared to online inference),resulting in degraded online inference performance of the hand gestureestimation.

Meanwhile, the features previously computed by the hand detectioncomponent fail to be utilized by the current hand gesture estimationcomponent. For each hand gesture estimation, it is required to extractthe image features from the original image, which leads to a waste ofcomputation and slower inference.

In order to solve the above two problems, these two tasks, i.e., thehand detection and the hand gesture estimation, are combined in thepresent disclosure. During the offline training, by connecting an outputof the first task with an input of the second task, the two tasks aretightly coupled. That is, a bounding box result and the calculated imagefeatures of the hand detection task are directly inputted to the handgesture estimation task. These two models are established through mutualinfluence, and thus benefit from task combination.

That is to say, in the embodiments of the present disclosure, the handgesture detection apparatus, when performing the hand gesture detectionprocessing, can combine the two tasks of the hand detection and the handgesture estimation end-to-end. Specifically, the hand gesture detectionapparatus can connect the output result of the hand detection to theinput end of the hand gesture estimation through the RoIAlign featureextractor, and further, it can complete the hand gesture detection byusing a second feature map output by the RoIAlign feature extractor asthe input of the gesture estimation model. It can be seen that, in thehand gesture detection method proposed in the embodiments of the presentdisclosure, only the backbone feature extractor is used to perform onefeature extraction on an initial depth image to achieve the jointprocessing of the hand detection and the hand gesture estimation,thereby greatly reducing the amount of computation and effectivelyimproving the detection efficiency and accuracy of the hand gesture.

The embodiments of the present disclosure will be described in detailbelow with reference to the accompanying drawings.

An embodiment of the present disclosure provides a hand gesturedetection method. FIG. 9 is a first schematic flowchart of a handgesture detection method according to an embodiment of the presentdisclosure. As illustrated in FIG. 9, in the embodiment of the presentdisclosure, the hand gesture detection method, which is performed by ahand gesture detection apparatus, may include the following actions inblocks.

At block 101, an initial depth image including a hand to be detected isobtained, and detection processing is performed on the initial depthimage by using a backbone feature extractor and a bounding box detectionmodel, to obtain initial bounding boxes and a first feature mapcorresponding to the hand to be detected.

In the embodiments of the present disclosure, the hand gesture detectionapparatus can obtain the initial depth image corresponding to the handto be detected, and then can use the backbone feature extractor and thebounding box detection model to perform the detection processing on theinitial depth image, so as to obtain initial bounding boxes and thefirst feature map corresponding to the hand to be detected.

It should be noted that, in the present disclosure, the obtained initialdepth image is a depth image of the hand to be detected, that is, theinitial depth image includes the hand to be detected.

It should be noted that, in the embodiments of the present disclosure,the hand gesture detection method is applicable to the hand gesturedetection apparatus, or an electronic apparatus integrated with the handgesture detection apparatus. The electronic apparatus may be a smartphone, a tablet computer, a notebook computer, a handheld computer, apersonal digital assistant (PDA), a navigation apparatus, a wearableapparatus, a desktop computer, etc., which are not limited to any ofthese examples in the embodiments of the present disclosure.

It can be understood that, for a human hand, there may be a plurality ofkey nodes, i.e., key points, in the skeleton of the hand. Usually, thehand includes at least 20 key points. The specific positions of the 20key points on the hand are illustrated in FIG. 3.

Further, in the embodiments of the present disclosure, when the handgesture detection apparatus uses the backbone feature extractor and thebounding box detection model to perform the detection processing on theinitial depth image to obtain the initial feature map and the boundingboxes, the hand gesture detection apparatus can first input the initialdepth image into the backbone feature extractor, and then output thefirst feature map; thereafter, the hand gesture detection apparatus canobtain, based on the first feature map, the initial bounding boxes byusing the bounding box detection model.

It should be noted that, in the embodiments of the present disclosure,when the hand gesture detection apparatus obtains, based on the firstfeature map, the initial bounding boxes by using the bounding boxdetection model, the hand gesture detection apparatus inputs the firstfeature map into the bounding box detection model, so as to output aplurality of bounding boxes and a plurality of confidences correspondingto the plurality of bounding boxes in one-to-one correspondence; andthen, the hand gesture detection apparatus determines, based on theplurality of confidences, a part of the plurality of bounding boxes asthe initial bounding boxes.

That is to say, in the present disclosure, when the hand gesturedetection apparatus uses the bounding box detection model, the pluralityof bounding boxes can be obtained by using the confidences of thebounding boxes, that is, there may be multiple initial bounding boxes.

Further, in the embodiments of the present disclosure, when the handgesture detection apparatus performs training of the bounding boxdetection model, the hand gesture detection apparatus can train aselection processing of the bounding boxes. For example, in the trainingprocess, the selection processing of a target bounding box may beselecting and outputting 32 optimal bounding boxes from all the boundingbox detection results as the initial bounding boxes.

It can be understood that, in the embodiments of the present disclosure,the bounding box may be used to perform the hand detection processing onthe initial depth image. That is, a position and size corresponding tothe hand may be determined through the bounding box.

Further, in the embodiments of the present disclosure, the hand gesturedetection apparatus uses the backbone feature extractor to performfeature extraction on the initial depth image, and the obtained firstfeature image is a depth image that only includes the hand to bedetected, i.e., the first feature map.

At block 102, a target bounding box is determined based on the initialbounding boxes. The target bounding box is one of the initial boundingboxes.

In the embodiments of the present disclosure, the hand gesture detectionapparatus determines the target bounding box based on the initialbounding boxes, after the hand gesture detection apparatus obtains theinitial depth image, performs the detection processing on the initialdepth image by using the backbone feature extractor and the bounding boxdetection model to obtain initial bounding boxes and the first featuremap corresponding to the hand to be detected.

It should be noted that, in the present disclosure, the target boundingbox may be one of the initial bounding boxes. That is, the hand gesturedetection apparatus can select one bounding box from the plurality ofinitial bounding boxes, as the final target bounding box.

Further, in the embodiments of the present disclosure, as the targetbounding box, the hand gesture detection apparatus may determine aninitial bounding box corresponding to a maximum confidence among theplurality of confidences.

That is to say, in the present disclosure, based on the one-to-onecorrespondence between the confidences and the initial bounding boxes,the hand gesture detection apparatus may perform a comparison processingon the confidences to determine the maximum confidence, so as todetermine the initial bounding box corresponding to the maximumconfidence as the target bounding box.

Further, in the embodiments of the present disclosure, when determiningthe target bounding box based on the initial bounding boxes, the handgesture detection apparatus may first determine an intersectionparameter between the initial bounding boxes. If the intersectionparameter is greater than a predetermined intersection threshold, thehand gesture detection apparatus can perform down-sampling processing onthe initial bounding boxes to obtain spare bounding boxes; and the handgesture detection apparatus determines a spare bounding boxcorresponding to the maximum confidence among the confidencescorresponding to the spare bounding boxes, as the target bounding box.

That is to say, in the present disclosure, when determining the targetbounding box, the hand gesture detection apparatus may further reducethe number of bounding boxes to be selected by determining theintersection parameter between the initial bounding boxes. Then, thehand gesture detection apparatus selects the target bounding box fromthe spare bounding boxes obtained by the down-sampling processing. Thetarget bounding box is the spare bounding box with the highestconfidence among the spare bounding boxes.

It should be noted that, in the embodiments of the present disclosure,the predetermined intersection threshold is a specific value preset bythe hand gesture detection apparatus and used for determining whether toperform the down-sampling processing. For example, the predeterminedintersection threshold can be 0.5.

It can be understood that, in the embodiments of the present disclosure,after outputting the plurality of initial bounding boxes by using thebounding box detection model, the hand gesture detection apparatus canfurther select the spare bounding boxes based on the intersectionbetween the initial bounding boxes. Specifically, the hand gesturedetection apparatus can obtain the spare bounding boxes through thedown-sampling processing. In the down-sampling process of the boundingboxes, if the intersection parameter of any two initial bounding boxesis greater than 0.5 (i.e., the predetermined intersection threshold),the hand gesture detection apparatus can perform the down-samplingprocessing on the initial bounding boxes to obtain the spare boundingboxes.

It can be seen that, in the embodiments of the present disclosure, whendetermining the target bounding box, the hand gesture detectionapparatus can either directly select one target bounding box from theplurality of initial bounding boxes based on the confidencescorresponding to the initial bounding boxes; or perform thedown-sampling processing on the plurality of initial bounding boxesfirst to obtain a smaller number of spare bounding boxes, and thendetermine the target bounding box based on the confidences correspondingto the spare bounding boxes.

At block 103, based on the target bounding box, the first feature map iscropped by using an RoIAlign feature extractor, to obtain a secondfeature map corresponding to the hand to be detected.

In the embodiments of the present disclosure, after the target boundingbox is determined based on the initial bounding boxes, the hand gesturedetection apparatus may crop, based on the target bounding box, thefirst feature map by using the RoIAlign feature extractor to obtain thesecond feature map corresponding to the hand to be detected.

Further, in the embodiments of the present disclosure, when the handgesture detection apparatus crops, based on the target bounding box, thefirst feature map by using the RoIAlign feature extractor to obtain thesecond feature map corresponding to the hand to be detected, the handgesture detection apparatus can input the target bounding box and thefirst feature map into the RoIAlign feature extractor, to output thesecond feature map.

It can be understood that, in the embodiments of the present disclosure,the ROIAlign feature extractor can be configured to perform a shallowfeature extraction on the first feature map corresponding to the hand tobe detected, which may specifically include a general outline and edgepositions of the hand to be detected, so as to obtain an ROIAlignfeature map corresponding to the hand to be detected, i.e., the secondfeature map of the hand to be detected.

Further, in the embodiments of the present disclosure, when the handgesture detection apparatus crops, based on the target bounding box, thefirst feature map by using the RoIAlign feature extractor to obtain thesecond feature map corresponding to the hand to be detected, the handgesture detection apparatus may first determine a cropping region basedon the target bounding box, and then crop, based on the cropping region,the first feature map by using the RoIAlign feature extractor, to obtainthe second feature map.

It can be understood that, in the embodiments of the present disclosure,the bounding box may be used to perform hand detection processing on theinitial depth image, that is, the position and size corresponding to thehand may be determined by means of the bounding box. Therefore, the handgesture detection apparatus can first determine the cropping region byusing the target bounding box with the highest confidence, then crop thecropping region by using the RoIAlign feature extractor, and finallygenerate the second feature image.

At block 104, based on the second feature map, three-dimensional gestureestimation processing is performed on the hand to be detected by usingthe gesture estimation model to obtain a gesture detection result of thehand to be detected.

In the embodiments of the present disclosure, after the hand gesturedetection apparatus crops, based on the target bounding box, the firstfeature map by using the RoIAlign feature extractor to obtain the secondfeature map corresponding to the to-be-detected hand, the hand gesturedetection apparatus can perform, based on the second feature map, thethree-dimensional gesture estimation processing on the hand to bedetected by using the gesture estimation model to obtain the gesturedetection result of the hand to be detected.

Further, in the embodiments of the present disclosure, when the handgesture detection apparatus performs, based on the second feature map,the three-dimensional gesture estimation processing on the hand to bedetected by using the gesture estimation model to obtain the gesturedetection result of the hand to be detected, the hand gesture detectionapparatus directly inputs the second feature map into the gestureestimation model and outputs the gesture detection result correspondingto the hand to be detected.

It should be noted that, in the present disclosure, the hand gesturedetection apparatus performs the gesture detection for the hand to bedetected based on the ROIAlign feature map of the hand, i.e., the secondfeature map of the hand to be detected. Specifically, the ROIAlignfeature extractor is configured to perform the shallow featureextraction, that is, the second feature map, which is obtained throughfeature extraction performed on the first feature map by the handgesture detection apparatus using the ROIAlign feature extractor, cannotrepresent deep features of the hand to be detected. Therefore, the handgesture detection apparatus can further use the gesture estimation modelto complete the deep feature extraction of the hand to be detected.

It can be understood that, in the embodiments of the present disclosure,the hand gesture detection apparatus uses the target bounding box andthe ROIAlign feature extractor to complete the detection of the hand tobe detected, and the obtained detection result is the second feature mapcorresponding to the hand to be detected. Then, the hand gesturedetection apparatus can further use the gesture estimation model tocomplete gesture estimation of the hand to be detected. The hand gesturedetection apparatus performs the gesture estimation processing based onthe second feature map. That is, the detection result after thedetection processing can be the input of the gesture estimation model tocomplete the gesture estimation processing.

FIG. 10 is a second schematic flowchart of a hand gesture detectionmethod according to an embodiment of the present disclosure. Asillustrated in FIG. 10, in the embodiments of the present disclosure,the hand gesture detection method performed by the hand gesturedetection apparatus may further include the following actions in blocks.

At block 105, a detection model and an estimation model are built.

At block 106, based on each of a plurality of training images includedin a training sample set, model training is performed on the detectionmodel by using a first predetermined loss function, and model trainingis performed on the estimation model by using a second predeterminedloss function.

At block 107, when a loss value of the first predetermined loss functionis within a first predetermined interval, a trained detection model isdetermined as the bounding box detection model.

At block 108, when a loss value of the second predetermined lossfunction is within a second predetermined interval, a trained estimationmodel is determined as the gesture estimation model.

In the embodiments of the present disclosure, the hand gesture detectionapparatus may first train the bounding box detection model and thegesture estimation model. The bounding box detection model is used todetermine a region corresponding to the hand to be detected, and thegesture estimation model is used to extract deep features of the hand tobe detected.

Specifically, in the embodiments of the present disclosure, the handgesture detection apparatus may first build the detection model and theestimation model. The detection model is used to train the bounding boxdetection model, and the estimation model is used to train the gestureestimation model.

Further, in the present disclosure, based on each training image in thetraining sample set, the hand gesture detection apparatus may performthe model training on the detection model by using the firstpredetermined loss function, and perform the model training on theestimated model by using the second predetermined loss function.

It should be noted that, in the embodiments of the present disclosure,the training sample set may include a plurality of training images. Thetraining sample set can be used to train both the bounding box detectionmodel and the gesture estimation model.

Further, in the embodiments of the present disclosure, during thetraining of the bounding box detection model, if the loss value of thefirst predetermined loss function is within the first predeterminedinterval, the trained detection model can be determined as the boundingbox detection model.

Further, in the embodiments of the present disclosure, during thetraining of the gesture estimation model, if the loss value of thesecond predetermined loss function is within the second predeterminedinterval, the trained estimation model can be determined as the gestureestimation model.

In summary, in the embodiments of the present disclosure, through thehand gesture detection method as described in the above blocks 101 to108, the hand gesture detection apparatus can combine the tasks of thehand detection and the hand gesture estimation. In one aspect, thebackbone feature extractor can be used only in the process of handdetection, thereby saving computational cost. In another aspect, thetraining and inference are consistent, that is, the training andinference are applied on the same bounding box, without requiring anadjustment of the bounding box. In yet another aspect, the hand gesturedetection apparatus can use more training samples to perform the handgesture detection, thereby improving the accuracy.

FIG. 11 is a schematic diagram of an architecture of the hand gestureestimation method according to an embodiment of the disclosure. Asillustrated in FIG. 11, when the hand gesture detection apparatusperforms hand gesture detection, it can couple the hand detection (A)with the hand gesture estimation (B). Specifically, the hand gesturedetection apparatus can perform feature extraction on an initial depthimage through a backbone feature extractor to obtain a first feature map(A1), while it can detect a target (hand) through a bounding boxdetection model to further determine initial bounding boxes (A2). Aftera target bounding box is obtained through a bounding box selectionprocessing (A3), that is, after the bounding box with the highestconfidence is selected, the hand gesture detection apparatus can furtherperform a shallow feature extraction on the first feature map (A4) usingthe target bounding box and an ROIAlign feature extractor to obtain ahand ROIAlign feature map, i.e., a second feature map. Then, based onthe second feature map, the hand gesture detection apparatus cancontinue to process the next task, i.e., the task of performing the handgesture estimation (B). Specifically, the hand gesture detectionapparatus may input the second feature map into the gesture estimationmodel (B1), and finally obtain a 3D hand gesture estimation result ofthe hand to be detected.

In this regard, the hand gesture detection method proposed in theembodiments of the present disclosure can realize cascaded handdetection and hand gesture estimation and can combine the hand detectionand the hand gesture estimation end-to-end. That is, in the training anddetection, the hand gesture detection apparatus uses the ROIAlignfeature extractor to connect the output of the hand detection to theinput of the hand gesture estimation.

Further, in the embodiments of the present disclosure, the hand gesturedetection apparatus can emit a bounding box adjustment for aligning theinput for training and inference.

It should be noted that, in the embodiments of the present disclosure,the ROIAlign feature extractor is used to connect the output of handdetection and the input of hand gesture estimation. For example, thebackbone feature extractor outputs image feature F with a size of12×15×256 (height×width×channel), i.e., the first feature map. Thebounding box detection model inputs image feature F and outputs theinitial bounding boxes, and obtains the target bounding box B after theselection processing. The RoAlign feature extractor crops the imagefeature F, and a cropped region is defined by the target bounding box B.The RoAlign feature extractor outputs a cropped region-of-interestfeature (RoI features, i.e., the second feature map) with a size of 8×8,and the RoI feature is inputted into the next task, which is used forhand gesture estimation processing.

It can be understood that, in the present disclosure, before theROIAlign feature extraction, the hand gesture detection apparatus needsto perform the selection processing of the target bounding box first.Specifically, the hand gesture detection apparatus may first select theoptimal candidate bounding boxes for training; then select the optimalcandidate bounding boxes for inference. The bounding boxes with thehigher confidence are determined as the optimal bounding boxes if thebounding boxes have higher confidence after applying NMS.

Further, in the present disclosure, when selecting bounding boxes in thetraining process, 32 optimal bounding boxes may be selected from allbounding box detection results. First, NMS is applied to 1500 boundingbox detection results, and 800 bounding boxes with higher confidence areoutputted. The 800 bounding boxes are then sampled into 8 boundingboxes. If the bounding boxes have an intersection greater than 0.5, thebounding boxes will be sampled, and the bounding boxes also having thetop confidence scores are used for training.

Correspondingly, in the present disclosure, when selecting a boundingbox in the inference process, only one bounding box, i.e., the boundingbox with the highest confidence, may be output in the inference process.

In the hand gesture detection method provided by the embodiments of thepresent disclosure, the hand gesture detection apparatus obtains theinitial depth image including a hand to be detected, and performs thedetection processing on the initial depth image by using a backbonefeature extractor and a bounding box detection model, to obtain initialbounding boxes and a first feature map corresponding to the hand to bedetected; based on the initial bounding boxes, the hand gesturedetection apparatus determines a target bounding box, which is one ofthe initial bounding boxes; based on the target bounding box, the handgesture detection apparatus crops the first feature map using anRoIAlign feature extractor, to obtain the second feature mapcorresponding to the hand to be detected; based on the second featuremap, the hand gesture detection apparatus performs three-dimensionalgesture estimation processing on the hand to be detected using a gestureestimation model, to obtain a gesture detection result of the hand to bedetected. In other words, in the embodiments of the present disclosure,when the hand gesture detection apparatus performs the hand gesturedetection processing, it can combine the two tasks of the hand detectionand the hand gesture estimation end-to-end. Specifically, the handgesture detection apparatus can couple the output result of the handdetection with the input end of the hand gesture estimation through theRoIAlign feature extractor, and it can use the second feature map outputby the RoIAlign feature extractor as the input of the gesture estimationmodel to complete the hand gesture detection. In view of the above, inthe hand gesture detection method proposed in the embodiments of thepresent disclosure, the backbone feature extractor is only used toperform one feature extraction on the initial depth image, therebyachieving the joint processing of the hand detection and the handgesture estimation. Therefore, the amount of computation can be greatlyreduced, and the detection efficiency and accuracy of the hand gesturecan be effectively improved.

Based on the above-mentioned embodiments, in yet another embodiment ofthe present disclosure, FIG. 12 illustrates a third schematic flowchartof a hand gesture detection method according to an embodiment of thepresent disclosure. As illustrated in FIG. 12, the hand gesturedetection method performed by a hand gesture detection apparatus mayfurther include the following actions in blocks.

At block 201, the second feature map is inputted into an image featureextraction network to obtain an image information set feature mapcorresponding to the first feature map.

In the embodiments of the present disclosure, after the second featuremap of the hand is obtained, the hand gesture detection apparatus mayinput the second feature map into the image feature extraction networkto obtain the image information set feature map corresponding to thefirst feature map.

It should be noted that, in the embodiments of the present disclosure,the second feature map is the extraction of shallow image informationsuch as hand edges and outlines, and the image feature network canextract deep image information such as hand curvature and length.

It can be understood that, after the shallow feature extraction ofROIAlign and the deep feature extraction of the image feature networkare performed, all the image information of the hand can be obtained,that is, the image information set feature map corresponding to thefirst feature map in the embodiment of the present disclosure.

It should be noted that, in the embodiments of the present disclosure,the image feature extraction network includes a first dimensionalityreduction network for performing channel reduction on image information,and a deep convolutional network for performing deep feature extractionbased on the dimensionality-reduced image information.

Specifically, in order to reduce the amount of computation ofprocessing, the hand gesture detection apparatus can input the secondfeature map into the first dimensionality reduction network to performthe channel reduction processing on the second feature map through thefirst dimensionality reduction network, and then it can obtain a firstdimensionality-reduced feature map.

The hand gesture detection apparatus can further input the obtainedfirst dimensionality-reduced feature map into the deep convolutionalnetwork, to perform deeper image information extraction on the firstdimensionality-reduced feature map through the deep convolutionalnetwork, and thus it can obtain the image information set feature map.

In an alternative embodiment, in the embodiments of the presentdisclosure, the deep convolutional network may use an iterativeconvolutional network in which input and output are superimposed, thatis, the input of each layer of the convolutional network is a sum of theinput and output of the previous layer of the convolutional network. Thesame convolutional network can be used for multiple iterativeconvolution processing, so that the final number of feature maps outputthrough the deep convolutional network is the same as the number offeature maps of the original input. That is to say, the deepconvolutional network is only an extraction process of image informationwithout changing the number of image feature maps.

For example, after the hand gesture detection apparatus obtains thesecond feature map with a size of 8×8×256, the hand gesture detectionapparatus can input the 8×8×256 feature map into a 3×3×128 firstdimensionality reduction network for channel reduction, so as to obtainan 8×8×128 dimensionality-reduced feature map. The hand gesturedetection apparatus can further input the 8×8×128 dimensionality-reducedfeature map into a deep convolutional network with four convolutionlayers, whose inputs and outputs are superimposed, to extract the imageinformation, thereby obtaining an 8×8×128 image information set featuremap with the same number of dimensionality-reduced feature maps.

Further, in the embodiments of the present disclosure, after obtainingthe image information set feature map, the hand gesture detectionapparatus may further perform up-sampling processing on the imageinformation set feature map.

At block 202, up-sampling processing is performed on the imageinformation set feature map to obtain a target resolution feature map.

In the embodiments of the present disclosure, after obtaining the imageinformation set feature map, the hand gesture detection apparatus mayfurther perform the up-sampling processing on the image information setfeature map, in order to obtain the target resolution feature map.

It can be understood that the processes of performing the ROIAlignshallow feature extraction, first dimensionality reduction processing,and deep feature extraction processing corresponding to the deepconvolutional network on the image are processes for reducing aresolution of an original image. In the embodiments of the presentdisclosure, the hand gesture detection apparatus may enhance theresolution of the image information set feature map by the up-sampling,i.e., a deconvolution processing, in order to avoid a loss of imageinformation, which may occur in the subsequent depth estimation on thelow-resolution feature map.

In an alternative embodiment, the resolution of the image informationset feature map can be increased to the same as the resolution of theinitial depth feature map, or the same as the resolution of the firstfeature map after the bounding box detection, so as to obtain thecorresponding target resolution feature map.

For example, assuming that the initial depth image or the first featuremap is with a size of 16×16×128, the hand gesture detection apparatusneeds to perform 2 times up-sampling processing on the image informationfeature map with a size of 8×8×128, in order to obtain the targetresolution feature map with a size of 16×16×128.

Further, in the embodiments of the present disclosure, after obtainingthe target resolution feature map through the up-sampling processing,the hand gesture detection apparatus may further perform classificationprocessing on a depth interval of the hand key points in the hand depthimage based on the target resolution feature map.

At block 203, the target resolution feature map is inputted into apredetermined depth classification network to obtain depth mapscorresponding to the hand key points in the first feature map. Thepredetermined depth classification network is used to distinguish thehand key points with different depths.

In the embodiments of the present disclosure, after obtaining the targetresolution feature map, the hand gesture detection apparatus may inputthe target resolution feature map into a predetermined depthclassification network, so as to further obtain the depth mapcorresponding to each hand key point in the hand depth image.

It can be understood that when the human hand performs a certain gestureaction, the corresponding positions, curvatures and gestures ofrespective fingers are different. In this regard, when the hand is in aspecific position, an interval distance, i.e. a depth interval value,between the same finger of the hand and a position such as the head,chest, or eye of the human body may be different, and the intervalvalues corresponding to different fingers of the hand are moredifferent. In the embodiments of the present disclosure, the handgesture detection apparatus may set the positions of the hand keypoints, and classify each hand key point based on different depthintervals.

Specifically, in the embodiments of the present disclosure, the handgesture detection apparatus may establish the predetermined depthclassification network, and then classify the hand key points based ondifferent depth intervals through the depth classification network. Thatis, the hand gesture detection apparatus distinguishes the hand keypoints with different depths through the predetermined depthclassification network.

It should be noted that, in the embodiments of the present disclosure,depth maps refer to pictures or channels including distance informationof the hand key points, that is, the depth interval values.

Specifically, in the embodiments of the present disclosure, thepredetermined depth classification network can set the number of thehand key points and different depth interval reference values. Theprocess of inputting the target resolution feature map obtained afterthe deep feature extraction and the up-sampling processing into thepredetermined depth classification network is a process of roughlypredicting the depth interval value of each key point. Then, the handkey points are classified based on the predicted depth interval valuesto generate the depth maps including the predicted depth interval valuescorresponding to the hand key points. That is to say, through thepredetermined depth classification network, the depth interval valuescorresponding to N hand key points can be roughly predicted first, anddifferent depth interval values correspond to different depth maps.

In an alternative embodiment, the hand gesture detection apparatus maypredefine 20 key points, and after inputting the target resolutionfeature map into the predetermined depth classification network, thehand gesture detection apparatus can obtain 20 depth maps, whichcorrespond to the 20 key points and include the predicted depth intervalvalues corresponding to the 20 key points.

Further, in the embodiments of the present disclosure, after obtainingthe depth map corresponding to the hand key point, the hand gesturedetection apparatus may further determine a real depth valuecorresponding to the key point based on the depth map.

At block 204, depth values corresponding to the hand key points aredetermined based on the depth maps, to realize the hand gestureestimation.

In the embodiments of the present disclosure, after the hand gesturedetection apparatus obtains the depth maps corresponding to the hand keypoints, the hand gesture detection apparatus may determine the depthvalues corresponding to the hand key points based on the depth maps, andfurther implement the hand gesture estimation based on the depth values.

It can be understood that, since the depth map includes the depthinterval value corresponding to each hand key point, the hand gesturedetection apparatus can further determine depth coordinate of the handkey point based on the depth interval value of each hand key point depthmap.

It can be seen that, in the embodiments of the present disclosure, thedepth interval value corresponding to each hand key point is roughlypredicted, and the hand key points are classified by means of the depthclassification, so that the depth value corresponding to the hand keypoint is determined based on the depth interval value with higheraccuracy, thereby achieving an accurate and efficient depth estimationof hand gesture.

In the hand gesture detection method provided by the embodiments of thepresent disclosure, the hand gesture detection apparatus obtains theinitial depth image including a hand to be detected, and performs thedetection processing on the initial depth image by using a backbonefeature extractor and a bounding box detection model, to obtain initialbounding boxes and a first feature map corresponding to the hand to bedetected; based on the initial bounding boxes, the hand gesturedetection apparatus determines a target bounding box, which is one ofthe initial bounding boxes; based on the target bounding box, the handgesture detection apparatus crops the first feature map using anRoIAlign feature extractor, to obtain the second feature mapcorresponding to the hand to be detected; based on the second featuremap, the hand gesture detection apparatus performs three-dimensionalgesture estimation processing on the hand to be detected using a gestureestimation model, to obtain a gesture detection result of the hand to bedetected. In other words, in the embodiments of the present disclosure,when the hand gesture detection apparatus performs the hand gesturedetection processing, it can combine the two tasks of the hand detectionand the hand gesture estimation end-to-end. Specifically, the handgesture detection apparatus can couple the output result of the handdetection with the input end of the hand gesture estimation through theRoIAlign feature extractor, and it can use the second feature map outputby the RoIAlign feature extractor as the input of the gesture estimationmodel to complete the hand gesture detection. In view of the above, inthe hand gesture detection method proposed in the embodiments of thepresent disclosure, the backbone feature extractor is only used toperform one feature extraction on the initial depth image, therebyachieving the joint processing of the hand detection and the handgesture estimation. Therefore, the amount of computation can be greatlyreduced, and the detection efficiency and accuracy of the hand gesturecan be effectively improved.

Based on the above-mentioned embodiments, in another embodiment of thepresent disclosure, FIG. 13 is a first schematic diagram of a structuralcomposition of a hand gesture detection apparatus according to anembodiment of the present disclosure. As illustrated in FIG. 13, thehand gesture detection apparatus 10 proposed by the embodiment of thepresent disclosure may include an obtaining component 11, a detectioncomponent 12, a determining component 13, a cropping component 14, anestimation component 15, and a training component 16.

The obtaining component 11 is configured to obtain an initial depthimage including a hand to be detected.

The detection component 12 is configured to perform detection processingon the initial depth image by using a backbone feature extractor and abounding box detection model to obtain initial bounding boxes and afirst feature map corresponding to the hand to be detected.

The determining component 13 is configured to determine a targetbounding box based on the initial bounding boxes. The target boundingbox is one of the initial bounding boxes.

The cropping component 14 is configured to crop, based on the targetbounding box, the first feature map by using an RoIAlign featureextractor to obtain a second feature map corresponding to the hand to bedetected.

The estimation component 15 is configured to perform, based on thesecond feature map, a three-dimensional gesture estimation processing onthe hand to be detected by using a gesture estimation model to obtain agesture detection result of the hand to be detected.

Further, in the embodiments of the present disclosure, the detectioncomponent 12 is specifically configured to: input the initial depthimage into the backbone feature extractor and output the first featuremap; and obtain, based on the first feature map, the initial boundingboxes by using the bounding box detection model.

Further, in the embodiments of the present disclosure, the detectioncomponent 12 is further specifically configured to: input the firstfeature map into the bounding box detection model, and output aplurality of bounding boxes and a plurality of confidences correspondingto the plurality of bounding boxes in one-to-one correspondence; anddetermine, based on the plurality of confidences, a part of theplurality of bounding boxes in the plurality of bounding boxes as theinitial bounding boxes.

Further, in the embodiments of the present disclosure, the determiningcomponent 13 is specifically configured to determine, as the targetbounding box, an initial bounding box corresponding to a maximumconfidence among the plurality of confidences corresponding to theinitial bounding boxes.

Further, in the embodiments of the present disclosure, the determiningcomponent 13 is specifically configured to: determine an intersectionparameter between the initial bounding boxes; down-sample the initialbounding boxes to obtain spare bounding boxes when the intersectionparameter is greater than a predetermined intersection threshold; anddetermine, as the target bounding box, a bounding box corresponding to amaximum confidence among the plurality of confidences corresponding tothe spare bounding boxes.

Further, in the embodiments of the present disclosure, the croppingcomponent 14 is specifically configured to input the target bounding boxand the first feature map into the RoIAlign feature extractor, andoutput the second feature map.

Further, in the embodiments of the present disclosure, the croppingcomponent 14 is further specifically configured to determine a croppingregion based on the target bounding box; and crop, based on the croppingregion, the first feature map by using the RoIAlign feature extractor toobtain the second feature map.

Further, in the embodiments of the present disclosure, the estimationcomponent 15 is specifically configured to input the second feature mapinto the gesture estimation model, and output the gesture detectionresult.

Further, in the embodiments of the present disclosure, the trainingcomponent 16 is configured to: build a detection model and an estimationmodel; based on each of a plurality of training images included in atraining sample set, perform model training on the detection model byusing a first predetermined loss function, and perform model training onthe estimation model by using a second predetermined loss function;determine, when a loss value of the first predetermined loss function iswithin a first predetermined interval, a trained detection model as thebounding box detection model; or determine, when a loss value of thesecond predetermined loss function is within a second predeterminedinterval, a trained estimation model as the gesture estimation model.

In the embodiments of the present disclosure, FIG. 14 is a secondschematic diagram of a structural composition of a hand gesturedetection apparatus according to an embodiment of the presentdisclosure. As illustrated in FIG. 14, the hand gesture detectionapparatus 10 proposed by the embodiment of the present disclosure mayinclude a processor 17, and a memory 18 having instructions storedthereon and being executable by the processor 17. Further, the handgesture detection apparatus 10 can further include a communicationinterface 19, and a bus 110 for connecting the processor 17, the memory18 and the communication interface 19.

In the embodiments of the present disclosure, the above-mentionedprocessor 17 may be at least one of an Application Specific IntegratedCircuit (ASIC), a Digital Signal Processor (DSP), or a Digital SignalProcessing Device (DSPD), a Programmable Logic Device (PLD), a FieldProgrammable Gate Array (FPGA), a Central Processing Unit (CPU), acontroller, or a microcontroller. It can be understood that, fordifferent apparatuses, the electronic apparatus used to implement theabove processor function may also be others and is not specificallylimited to any of these examples in the embodiments of the presentdisclosure. The hand gesture detection apparatus 10 may include a memory18, which may be connected to the processor 17. The memory 18 is used tostore executable program codes, including computer operatinginstructions. The memory 18 may include a high-speed RAM memory, or anon-volatile memory, e.g., at least two disk storages.

In the embodiments of the present disclosure, the bus 110 is used toconnect the communication interface 19, the processor 17 and the memory18, for the mutual communication of these devices.

In the embodiments of the present disclosure, the memory 18 is used forstoring instructions and data.

Further, in the embodiments of the present disclosure, theabove-mentioned processor 17 is configured to obtain an initial depthimage including a hand to be detected, and perform detection processingon the initial depth image by using a backbone feature extractor and abounding box detection model, to obtain initial bounding boxes and afirst feature map corresponding to the hand to be detected; determine atarget bounding box based on the initial bounding boxes, the boundingbox being one of the initial bounding boxes; crop, based on the targetbounding box, the first feature map by using an RoIAlign featureextractor, to obtain a second feature map corresponding to the hand tobe detected; and perform, based on the second feature map, athree-dimensional gesture estimation processing on the hand to bedetected by using a gesture estimation model to obtain a gesturedetection result of the hand to be detected.

In practical applications, the above-mentioned memory 18 may be avolatile memory, such as a Random-Access Memory (RAM); or a non-volatilememory such as a Read-Only Memory (ROM), a flash memory, a Hard DiskDrive (HDD) or a Solid-State Drive (SSD); or a combination of the abovetypes of memory, and provide instructions and data to the processor 17.

In addition, in the present embodiment, each functional module may beintegrated into one processing unit, or each unit may exist physicallyalone, or two or more units may be integrated into one unit. Theabove-mentioned integrated units can be implemented in the form ofhardware, or can be implemented in the form of software functionmodules.

If the integrated unit is implemented in the form of software functionmodule and is not sold or used as an independent product, the integratedunit can be stored in a computer-readable storage medium. In thisregard, the technical solution of the embodiment or the part thatcontributes to the prior art, or all or part of the technical solutioncan be embodied in the form of a software product. The computer softwareproduct is stored in a storage medium, and includes several instructionsto enable a computer device (a personal computer, a server, or a networkdevice, etc.) or a processor execute all or part of the steps of themethod in the embodiments. The aforementioned storage medium includes aU disk, a mobile hard disk, an ROM, an RAM, a magnetic disk or anoptical disk and other media that can store program codes.

In the hand gesture detection method provided by the embodiments of thepresent disclosure, the hand gesture detection apparatus obtains theinitial depth image including a hand to be detected, and performs thedetection processing on the initial depth image by using a backbonefeature extractor and a bounding box detection model, to obtain initialbounding boxes and a first feature map corresponding to the hand to bedetected; based on the initial bounding boxes, the hand gesturedetection apparatus determines a target bounding box, which is one ofthe initial bounding boxes; based on the target bounding box, the handgesture detection apparatus crops the first feature map using anRoIAlign feature extractor, to obtain the second feature mapcorresponding to the hand to be detected; based on the second featuremap, the hand gesture detection apparatus performs three-dimensionalgesture estimation processing on the hand to be detected using a gestureestimation model, to obtain a gesture detection result of the hand to bedetected. In other words, in the embodiments of the present disclosure,when the hand gesture detection apparatus performs the hand gesturedetection processing, it can combine the two tasks of the hand detectionand the hand gesture estimation end-to-end. Specifically, the handgesture detection apparatus can couple the output result of the handdetection with the input end of the hand gesture estimation through theRoIAlign feature extractor, and it can use the second feature map outputby the RoIAlign feature extractor as the input of the gesture estimationmodel to complete the hand gesture detection. In view of the above, inthe hand gesture detection method proposed in the embodiments of thepresent disclosure, the backbone feature extractor is only used toperform one feature extraction on the initial depth image, therebyachieving the joint processing of the hand detection and the handgesture estimation. Therefore, the amount of computation can be greatlyreduced, and the detection efficiency and accuracy of the hand gesturecan be effectively improved.

An embodiment of the present disclosure provides a computer-readablestorage medium, on which a program is stored. When the program isexecuted by a processor, the program implements the above-described handgesture estimation method.

Specifically, the program instructions corresponding to the hand gestureestimation method in the embodiment can be stored on a storage mediumsuch as an optical disc, a hard disk, or a U disk. When the programinstructions corresponding to the calculation method in the storagemedium are read or executed by an electronic device, it includes thefollowing actions: obtaining an initial depth image comprising a hand tobe detected, and performing detection processing on the initial depthimage by using a backbone feature extractor and a bounding box detectionmodel, to obtain initial bounding boxes and a first feature mapcorresponding to the hand to be detected; determining a target boundingbox based on the initial bounding boxes, the target bounding box beingone of the initial bounding boxes; cropping, based on the targetbounding box, the first feature map by using an RoIAlign featureextractor, to obtain a second feature map corresponding to the hand tobe detected; and performing, based on the second feature map, athree-dimensional gesture estimation processing on the hand to bedetected by using a gesture estimation model to obtain a gesturedetection result of the hand to be detected.

It can be appreciated by those skilled in the art that, the embodimentsof the present disclosure may be provided as a method, a system, or acomputer program product. Accordingly, the present disclosure may be inthe form of a hardware embodiment, a software embodiment, or anembodiment combining software and hardware aspects. Furthermore, thepresent disclosure may take the form of a computer program productembodied on one or more computer-usable storage media (including but notlimited to a disk storage, an optical storage, and the like) havingcomputer-usable program codes included therein.

The present disclosure is described with reference to schematicflowcharts and/or block diagrams of implementations of methods, devices(systems), and computer program products according to the embodiments ofthe present disclosure. It can be understood that each process and/orblock in the schematic flowcharts and/or block diagrams, andcombinations of processes and/or blocks in the schematic flowchartsand/or block diagrams, can be implemented by computer programinstructions. These computer program instructions may be provided to aprocessor of a general purpose computer, special purpose computer,embedded processor or other programmable data processing device toproduce a machine. In this way, the instructions executed by theprocessor of the computer or other programmable data processing deviceproduce means for implementing the functions specified in one or moreprocesses in the schematic flowcharts and/or one or more blocks in theblock diagrams.

These computer program instructions may be stored in a computer-readablememory capable of causing a computer or other programmable dataprocessing devices to function in a particular manner, such that theinstructions stored in the computer-readable memory may result in anarticle of manufacture including instruction means, the instructionmeans can implement the functions specified in one or more processes inthe schematic flowcharts and/or one or more blocks in the blockdiagrams.

These computer program instructions can be loaded on a computer or otherprogrammable data processing devices to cause a series of operationalsteps to be performed on the computer or other programmable devices toproduce a computer-implemented process, such that the instructionsexecuted in the computer or the other programmable data processingdevices provide steps for implementing the functions specified in one ormore processes in the schematic flowcharts or one or more blocks in theblock diagrams. The above are merely the preferable embodiments of thepresent disclosure, but is not intended to limit the protection scope ofthe present disclosure.

INDUSTRIAL APPLICABILITY

In the hand gesture detection method and device as well as the computerstorage medium provided by the embodiments of the present disclosure,the hand gesture detection apparatus obtains the initial depth imageincluding a hand to be detected, and performs the detection processingon the initial depth image by using a backbone feature extractor and abounding box detection model, to obtain initial bounding boxes and afirst feature map corresponding to the hand to be detected; based on theinitial bounding boxes, the hand gesture detection apparatus determinesa target bounding box, which is one of the initial bounding boxes; basedon the target bounding box, the hand gesture detection apparatus cropsthe first feature map using an RoIAlign feature extractor, to obtain thesecond feature map corresponding to the hand to be detected; based onthe second feature map, the hand gesture detection apparatus performsthree-dimensional gesture estimation processing on the hand to bedetected using a gesture estimation model, to obtain a gesture detectionresult of the hand to be detected. In other words, in the embodiments ofthe present disclosure, when the hand gesture detection apparatusperforms the hand gesture detection processing, it can combine the twotasks of the hand detection and the hand gesture estimation end-to-end.Specifically, the hand gesture detection apparatus can couple the outputresult of the hand detection with the input end of the hand gestureestimation through the RoIAlign feature extractor, and it can use thesecond feature map output by the RoIAlign feature extractor as the inputof the gesture estimation model to complete the hand gesture detection.In view of the above, in the hand gesture detection method proposed inthe embodiments of the present disclosure, the backbone featureextractor is only used to perform one feature extraction on the initialdepth image, thereby achieving the joint processing of the handdetection and the hand gesture estimation. Therefore, the amount ofcomputation can be greatly reduced, and the detection efficiency andaccuracy of the hand gesture can be effectively improved.

What is claimed is:
 1. A hand gesture detection method, comprising:obtaining an initial depth image comprising a hand to be detected, andperforming detection processing on the initial depth image by using abackbone feature extractor and a bounding box detection model, to obtaininitial bounding boxes and a first feature map corresponding to the handto be detected; determining a target bounding box based on the initialbounding boxes, the target bounding box being one of the initialbounding boxes; cropping, based on the target bounding box, the firstfeature map by using an RoIAlign feature extractor, to obtain a secondfeature map corresponding to the hand to be detected; and performing,based on the second feature map, a three-dimensional gesture estimationprocessing on the hand to be detected by using a gesture estimationmodel to obtain a gesture detection result of the hand to be detected.2. The method according to claim 1, wherein said performing thedetection processing on the initial depth image by using the backbonefeature extractor and the bounding box detection model, to obtain theinitial bounding boxes and the first feature map corresponding to thehand to be detected comprises: inputting the initial depth image intothe backbone feature extractor, and outputting the first feature map;and obtaining, based on the first feature map, the initial boundingboxes by using the bounding box detection model.
 3. The method accordingto claim 2, wherein said obtaining, based on the first feature map, theinitial bounding boxes by using the bounding box detection modelcomprises: inputting the first feature map into the bounding boxdetection model, and outputting a plurality of bounding boxes and aplurality of confidences corresponding to the plurality of boundingboxes in one-to-one correspondence; and determining, based on theplurality of confidences, a part of the plurality of bounding boxes inthe plurality of bounding boxes as the initial bounding boxes.
 4. Themethod according to claim 2, wherein said determining the targetbounding box based on the initial bounding boxes comprises: determining,as the target bounding box, an initial bounding box corresponding to amaximum confidence among the plurality of confidences corresponding tothe initial bounding boxes.
 5. The method of claim 3, wherein saiddetermining the target bounding box based on the initial bounding boxescomprises: determining an intersection parameter between the initialbounding boxes; down-sampling, when the intersection parameter isgreater than a predetermined intersection threshold, the initialbounding boxes to obtain spare bounding boxes; and determining, as thetarget bounding box, a bounding box corresponding to a maximumconfidence among the plurality of confidences corresponding to the sparebounding boxes.
 6. The method according to claim 1, wherein saidcropping, based on the target bounding box, the first feature map byusing the RoIAlign feature extractor to obtain the second feature mapcorresponding to the hand to be detected comprises: inputting the targetbounding box and the first feature map into the RoIAlign featureextractor, and outputting the second feature map.
 7. The methodaccording to claim 6, wherein said cropping, based on the targetbounding box, the first feature map by using the RoIAlign featureextractor to obtain the second feature map corresponding to the hand tobe detected comprises: determining a cropping region based on the targetbounding box; and cropping, based on the cropping region, the firstfeature map by using the RoIAlign feature extractor to obtain the secondfeature map.
 8. The method according to claim 1, wherein saidperforming, based on the second feature map, the three-dimensionalgesture estimation processing on the hand to be detected by using thegesture estimation model to obtain the gesture detection result of thehand to be detected comprises: inputting the second feature map into thegesture estimation model, and outputting the gesture detection result.9. The method according to claim 1, further comprising: building adetection model and an estimation model; based on each of a plurality oftraining images comprised in a training sample set, performing modeltraining on the detection model by using a first predetermined lossfunction, and performing model training on the estimation model by usinga second predetermined loss function; determining, when a loss value ofthe first predetermined loss function is within a first predeterminedinterval, a trained detection model as the bounding box detection model;and determining, when a loss value of the second predetermined lossfunction is within a second predetermined interval, a trained estimationmodel as the gesture estimation model.
 10. A hand gesture detectionapparatus, comprising: a processor; and a memory having instructionsstored thereon and executable by the processor, wherein theinstructions, when executed by the processor, implement a hand gesturedetection method comprising: obtaining an initial depth image comprisinga hand to be detected, and performing detection processing on theinitial depth image by using a backbone feature extractor and a boundingbox detection model, to obtain initial bounding boxes and a firstfeature map corresponding to the hand to be detected; determining atarget bounding box based on the initial bounding boxes, the targetbounding box being one of the initial bounding boxes; cropping, based onthe target bounding box, the first feature map by using an RoIAlignfeature extractor, to obtain a second feature map corresponding to thehand to be detected; and performing, based on the second feature map, athree-dimensional gesture estimation processing on the hand to bedetected by using a gesture estimation model to obtain a gesturedetection result of the hand to be detected.
 11. The apparatus accordingto claim 10, wherein the instructions, when executed by the processor,implement: inputting the initial depth image into the backbone featureextractor, and outputting the first feature map; and obtaining, based onthe first feature map, the initial bounding boxes by using the boundingbox detection model.
 12. The apparatus according to claim 11, whereinthe instructions, when executed by the processor, implement: inputtingthe first feature map into the bounding box detection model, andoutputting a plurality of bounding boxes and a plurality of confidencescorresponding to the plurality of bounding boxes in one-to-onecorrespondence; and determining, based on the plurality of confidences,a part of the plurality of bounding boxes in the plurality of boundingboxes as the initial bounding boxes.
 13. The apparatus according toclaim 11, wherein the instructions, when executed by the processor,implement: determining, as the target bounding box, an initial boundingbox corresponding to a maximum confidence among the plurality ofconfidences corresponding to the initial bounding boxes.
 14. Theapparatus according to claim 13, wherein the instructions, when executedby the processor, implement: determining an intersection parameterbetween the initial bounding boxes; down-sampling, when the intersectionparameter is greater than a predetermined intersection threshold, theinitial bounding boxes to obtain spare bounding boxes; and determining,as the target bounding box, a bounding box corresponding to a maximumconfidence among the plurality of confidences corresponding to the sparebounding boxes.
 15. The apparatus according to claim 10, wherein theinstructions, when executed by the processor, implement: inputting thetarget bounding box and the first feature map into the RoIAlign featureextractor, and outputting the second feature map.
 16. The apparatusaccording to claim 15, wherein the instructions, when executed by theprocessor, implement: determining a cropping region based on the targetbounding box; and cropping, based on the cropping region, the firstfeature map by using the RoIAlign feature extractor to obtain the secondfeature map.
 17. The apparatus according to claim 10, wherein theinstructions, when executed by the processor, implement: inputting thesecond feature map into the gesture estimation model, and outputting thegesture detection result.
 18. The apparatus according to claim 10,wherein the instructions, when executed by the processor, furtherimplement: building a detection model and an estimation model; based oneach of a plurality of training images comprised in a training sampleset, performing model training on the detection model by using a firstpredetermined loss function, and performing model training on theestimation model by using a second predetermined loss function;determining, when a loss value of the first predetermined loss functionis within a first predetermined interval, a trained detection model asthe bounding box detection model; and determining, when a loss value ofthe second predetermined loss function is within a second predeterminedinterval, a trained estimation model as the gesture estimation model.19. A computer storage medium, having a program stored thereon andapplied to a hand gesture detection apparatus, wherein the program, whenexecuted by a processor, implements a hand gesture detection methodcomprising: obtaining an initial depth image comprising a hand to bedetected, and performing detection processing on the initial depth imageby using a backbone feature extractor and a bounding box detectionmodel, to obtain initial bounding boxes and a first feature mapcorresponding to the hand to be detected; determining a target boundingbox based on the initial bounding boxes, the target bounding box beingone of the initial bounding boxes; cropping, based on the targetbounding box, the first feature map by using an RoIAlign featureextractor, to obtain a second feature map corresponding to the hand tobe detected; and performing, based on the second feature map, athree-dimensional gesture estimation processing on the hand to bedetected by using a gesture estimation model to obtain a gesturedetection result of the hand to be detected.